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Abstract 

Background: Secondary use of large scale administrative data is increasingly popular in health services and clinical 
research, where a user-friendly tool for data management is in great demand. MapReduce technology such as 
Hadoop is a promising tool for this purpose, though its use has been limited by the lack of user-friendly functions 
for transforming large scale data into wide table format, where each subject is represented by one row, for use in 
health services and clinical research. Since the original specification of Pig provides very few functions for column 
field management, we have developed a novel system called GroupFilterFormat to handle the definition of field and 
data content based on a Pig Latin script. We have also developed, as an open-source project, several user-defined 
functions to transform the table format using GroupFilterFormat and to deal with processing that considers date 
conditions. 

Results: Having prepared dummy discharge summary data for 2.3 million inpatients and medical activity log data 
for 950 million events, we used the Elastic Compute Cloud environment provided by Amazon Inc. to execute 
processing speed and scaling benchmarks. In the speed benchmark test, the response time was significantly 
reduced and a linear relationship was observed between the quantity of data and processing time in both a small 
and a very large dataset. The scaling benchmark test showed clear scalability. In our system, doubling the number 
of nodes resulted in a 47% decrease in processing time. 

Conclusions: Our newly developed system is widely accessible as an open resource. This system is very simple and 
easy to use for researchers who are accustomed to using declarative command syntax for commercial statistical 
software and Structured Query Language. Although our system needs further sophistication to allow more flexibility 
in scripts and to improve efficiency in data processing, it shows promise in facilitating the application of 
MapReduce technology to efficient data processing with large scale administrative data in health services and 
clinical research. 
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Background 

Secondary large scale data such as nation-wide adminis- 
trative data are increasingly utilized in clinical and health 
service research for timely outcomes studies in real 
world settings [1-5]. This trend has further been fueled 
by recent improvements in informatics technology for 
handling ultra large volumes of on-site data through 
work parallelization and cloud computing. For example, 
the launch of the Sentinel System by the US Food and 
Drug Administration in 2008 aimed at establishing an 
active surveillance system for monitoring drug safety in 
real-time, using electronic data from multiple healthcare 
information holders. In Japan, the Diagnosis Procedure 
Combination (DPC) inpatient database survey has col- 
lected nationwide administrative data since 2003 [6], and 
several epidemiological studies have been based on this 
inpatient database [7-12]. 

Administrative data including time (e.g., day, hour), 
procedure, or episode (e.g., hospitalization or visit), are 
often presented in long table format. However, prevalent 
statistical software for epidemiological analyses prefers 
datasets prepared in wide table format with each indivi- 
dual record corresponding to one row. Extracting data 
from different sources requires linkage of data with 



multiple unique patient identifiers, and complicated 
steps for data merge and transformation (Figure 1). Fur- 
thermore, it is often necessary in epidemiological studies 
to calculate the time interval between different events 
recorded in different rows, and then to transform these 
data into a wide table column. Suppose one wished to 
know whether administration of antibiotics within three 
days after surgery reduced the chance of postsurgical in- 
fection. Then, the time interval between the first and last 
dates of antibiotic administration would need to be cal- 
culated and queried. 

Data processing and transformation as described above 
can be simply handled using existing Structured Query 
Language (SQL) technology [13] with a moderate sized 
dataset. However, treatment of ultra large databases such 
as the nationwide administrative data is beyond the sca- 
ling capacity of SQL, and instead parallelized environ- 
ments are required. Open source MapReduce technology 
such as Hadoop has become popular as a software frame- 
work for large scale distributed data analysis [14,15]. The 
usefulness of Hadoop, however, is still limited. Many epi- 
demiologists, health service researchers, and health policy 
analysts who are familiar with the declarative style of 
existing statistical software and SQL commands do not 
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Figure 1 Data transformation from log data to a wide table. 
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use Hadoop because MapReduce programming is too 
rigid and difficult for these end users to write procedural 
code. Also, these researchers often need trial- and- error 
ad-hoc analysis for data description and planning optimal 
analytic strategy. For such purposes, we need a more user- 
friendly framework that allows iterative, easy, and quick 
transformation of ultra-large scale administrative data into 
an analytic dataset. Recent development of the Pig Latin 
language is promising for filling this gap between proce- 
dural programmers and data users, and allows user- 
friendly use of MapReduce technology, though its use in 
the bioinformatics arena is still limited [16]. 

Given the background and incentives above, we have 
developed several user-defined functions (UDF) to 
process large scale administrative data for ease of epi- 
demiological analysis, based on a Pig Latin script in the 
Hadoop framework [17]. The developed script is very easy 
to use even for researchers who are accustomed to using 
declarative command syntax for statistical software and 
SQL. The developed functions were tested with a large 
claims database for response speed and scalability and, as 
presented below, achieved fairly satisfying results. 

Implementation 

Referred technologies 

Map and Reduce are common to many functional 
programming languages such as Lisp and Scheme. 
Google recently popularized the use of Map and Reduce 
as a simpler solution for parallelizing computation 
[18] for a certain subset of problems compared to other 
approaches such as Parallel Virtual Machines [19] 



(Figure 2). One major benefit of the MapReduce ap- 
proach is the ability to focus solely on the computation, 
and not the shuffling of data between processors. 
The programmer only needs to consider the computa- 
tion itself and can assume that the data will be available 
as required. This allows users with some programming 
experience to create and run jobs without extensive 
training in parallel computing. The second major bene- 
fit of MapReduce concerns data locality. With the 
MapReduce paradigm, most of the computation is done 
on a slave node, which contains a copy of the input 
data. This requires the minimal amount of data being 
sent over the network, resulting in increased overall 
efficiency. 

Hadoop is an open-source implementation of the 
MapReduce parallel programming paradigm and is sup- 
ported by the open-source community. Hadoop provides 
both the MapReduce parallel computation framework 
and a distributed file system (called the Hadoop Distri- 
buted File System, HDFS). Hadoop, which is an Apache 
Foundation project written in Java, provides a master- 
slave architecture where a single master node coordi- 
nates many slave machines, which carry out data storage 
and the actual computation. To enable data-local pro- 
cessing, each slave machine tries to use only data stored 
on the same machine for computation. This requires 
very little shuffling of data over the network, resulting in 
decreased demand for network I/O bandwidth. Ad- 
ditional slave nodes can be added to the cluster to in- 
crease HDFS storage capacity and computational power 
as necessary. 
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Finally, Apache Pig is a platform for analyzing large 
datasets and consists of a high-level language for expres- 
sing data analysis programs, coupled with the infrastruc- 
ture for evaluating these programs. The salient property 
of Pig programs is that their structure is amenable to 
substantial parallelization, which in turn enables them to 
handle very large datasets [17,20]. Pigs infrastructure 
layer consists of a compiler on the user s client machine 
that transforms Pig Latin programs into sequences of 
MapReduce programs that run in parallel on the nodes 
of a Hadoop cluster. Pig is a Java client side application 
installed locally by users, and thus nothing is altered on 
the Hadoop cluster itself [16]. 

Despite its innovativeness and broader feasibility, the 
original Pig Latin has several drawbacks. Firstly, it has 
only a "join" script to transform log data into a wide 
table format, in which the data are compiled for one 
field at a time. This prevents processing multiple fields 
in parallel, resulting in slow processing and inefficient 
script formation. Secondly, Pig includes only very poor 
functions for date processing. It is quite cumbersome to 
use Pig for filtering data by date or calculating day inter- 
vals between events. These drawbacks must be overcome 
to make Pig based scripts suitable for epidemiological 
studies. 

UDFs for data transformation into a table format 

Transformation of long-shaped log data into a wide table 
format requires management of the column field 
scheme, e.g., assignment of column names and field 
locations, and definition of the data content. Since the 
original Pig has very limited functions for column field 
management, we newly developed GroupFilterFormat to 
handle the definition of field and data content. GroupFil- 
terFormat also provides information linkage between dif- 
ferent code systems, and generates new categories and 
values. For example, pharmaceutical codes (such as Uni- 
versal Product Numbers in the United States or Japanese 
Article Numbers for pharmaceuticals in Japan) by pro- 
duct may be cumbersome to handle, and one may wish 
to categorize them into larger groups of pharmaceutic- 
ally equivalent products (according to their generic 
name). Furthermore, suppose pharmaceutical codes cor- 
respond not only to the types of medication, but also to 
the dose of the medication. GroupFilterFormat defines 
which pharmaceutical code should be categorized into a 
new larger category, and attaches the numeric dose in- 
formation to the code. 

The input format for GroupFilterFormat is as follows: 
'groupname (iteml [valuel, value2 . . .], item2 ... ), ... 
'where groupname = a new group name corresponding 
to a new field in wide format, item = the original code 
per item, and value = a numeric value attached to each 
item. 



In the Map phase, Exists filters the data by excluding 
data not defined by GroupFilterFormat, and reduces the 
data volume to improve the efficiency of data processing. 
In the Reduce phase, InnerGroup transforms the data 
from long to wide format allowing a row observation to 
correspond to each observed unit, e.g., patient or admis- 
sion event. The original Pig has no functions for column 
management. Developed data in a table format are 
prepared for numeric processing. Value-Join provides 
quantitative values linked with qualitative categorical 
information as defined in GroupFilterFormat for further 
numeric processing. For numeric calculation, the 
calculation functions originally available in Pig can be 
used (e.g., COUNT, SUM, MAX, and so on) (Figure 3). 

UDFs for the management of date data 

GetDaySpan calculates a day interval between two dates. 
AddDaySpan adds an n day interval to a date to obtain 
the date after the interval. These two UDFs are useful 
for calculating age and event intervals. PickupSequence- 
Values filters data observed consecutively for a period 
starting from an assigned date. This UDF is useful for 
extracting log data of pharmaceutical administration 
repeated over a period (Figure 4). 

Benchmark environment and test dataset 

Time efficiency is an important issue in data manage- 
ment. The main goal of this study was to provide 
researchers with open-source, time- efficient software for 
handling large scale administrative data. Existing me- 
thods designed to handle small datasets would require a 
vast amount of time to process a large dataset. This is a 
serious problem because it may hinder researchers in 
carrying out large-data studies. We developed our soft- 
ware to solve this problem and contribute to the en- 
hancement of research using a large administrative 
database. Consequently, we evaluated the performance 
of the software mainly in terms of time efficiency and 
scalability. 

The Elastic Compute Cloud (EC2) infrastructure ser- 
vice from Amazon was used as a test bed for the per- 
formance evaluation. We adopted a Large Instance 
provided by Amazon EC2 with the following configu- 
ration: 7.5 GB memory, 4 EC2 Compute Units (2 virtual 
cores with 2 EC2 Compute Units each), 850 GB instance 
storage, and a 64-bit platform. 

In this benchmarking test, we created dummy admin- 
istrative data for in-hospital services containing patient 
discharge summary data and medical activity logs for 20 
different kinds of medications. We prepared discharge 
summary data for 2.3 million inpatients and medical ac- 
tivity log data for 950 million events. The Input and 
Output data image is as shown in Additional file 1: 
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Script 

Datal = Filter Data by Exists($l) (#1) 
RELATION_A = group Datal by $0; (#2) 
RELATION_B = foreach RELATION_A generate group, 

lnnerGroup(DATA, DATA.$1); (#3) 
OUTPUT = foreach RELATION_B generate group, $1.$0, 

COUNT($l.$l), $1.$2, COUNT($l.$3), $1.$4, COUNT($l.$5); 



Map Phase 



Exists($l) 

Exists filters the data by excluding data not 
defined by GroupFilterFormat 



GroupFileFormat 

Medicine_a(aaaa,aaab), Medicine_b(bbbb), Medicine_c(ccca,cccb) 



OUTPUT 

(0000, medicine_a, 2, medicine_b, 2, medicine_c, 3) 
(1111, medicine_a, 2, medicine_b, 0, medicine_c, 0) 
(2222, medicine_a, 0, medicine_b, 1, medicine_c, 0) 



Reduce Phase 




(0000,aaaa) 
(0000,aaab) 



(0000,bbbb) 



(llll,aaaa) 
(llll,aaab) 



(2222,bbbb) 



#3 InnerGroup ($1) 

InnerGrouptransforms the data from long to wide 
format allowing a row observation to correspond to 
each observed unit. 



Figure 3 Step-by-step example of data transformation. $/V indicates the Nth column in the data field. 



Appendix 1 while the program script used for the 
benchmark test is given in Additional file 2: Appendix 2. 

We created a Hadoop cluster on Amazon EC2, com- 
posed of one master for the master name and job tracker 
node, and varying numbers of slave nodes for task 
tracker and data nodes. For the processing speed bench- 
mark, we used varying sized subsamples of the bench- 
mark test data, that is, 1/1 sample (23 million patients), 
1/2 sample (11.5 million patients), 1/4 sample (5.7 mil- 
lion patients), and 1/8 sample (2.6 million patients), and 
ran the same script 20 times with each subsample to 



measure the processing time using one master node and 
4 slave nodes. For the scaling benchmark, we used the 
entire sample data, and ran the same script 20 times 
using one master node and 2 slave nodes. Then we 
doubled the number of slave nodes until 48 nodes were 
used, repeatedly measuring the processing time. 

Results and discussion 

Table 1 and Figure 5 present the results of the proces- 
sing speed benchmark using regression in a linear 
model, (time(sec)) = 0.0015*(record)+155.22 R2=0.9998. 



Script 

RELATION_A = group Data by $0; (#1) 
OUTPUT = foreach RELATION_A generate $0, 

PickupSequenceValues($0, $0.$2, MIN($0.$2)); (#2) 



(0000,20071111,2871, ... ) 
(0000,20071112,2872, ... ) 
(1111,20101220,4006, ... ) 
(0000,20071113,2873, ... ) 



(0000,20071113,2873, 
(0000,20071119,2879, ... ) 
(1111,20101209,3995, ... ) 
(1111,20101221,4007, ... ) 



OUTPUT 

(0000,{({(0000,20071111,2871, ... )}),({(0000,20071112,2872, ... )}),({(0000, 
20071113,2873, ... ),(0000,20071113,2873, ... )})}) 
(1111,{({(1111,20101209,3995, ... )})}) 




(0000,20071111,2871, . 
(0000,20071112,2872, . 
(0000,20071113,2873, . 
(0000,20071113,2873, . 
(0000,20071119,2879, . 



(1111,20101209,3995, . 
(1111,20101221,4006, . 
(1111,20101220,4007, . 



grou 
P 


MIN($0.$2) 
= 2871 


MIN($0.$2) 
+1=2872 


MIN($0.$2) 
+2=2873 


MIN($0.$2) 
+3=2874 


0000 


(0000,2007 
1111,2871) 


(0000,2007 
1112,2872) 


(0000,2007 
1113,2873) 
(0000,2007 
1113,2873) 


null 



group 


MIN($0.$2) 
= 3995 


MIN($0.$2) 
+1 = 3996 


1111 


(1111,2010 
1209,3995) 


null 



PickupSequenceValues($0, $0.$2, MIN($0.$2)) 
PickupSequenceValue^ilters data observed consecutively 
for a period starting from an assigned date . 



Figure 4 Step-by-step example of date management. $/V indicates the Nth column in the data field. 
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Table 1 Results of the processing speed benchmark 



Number of 
records 


Average processing 
time (s) 


Max 
time (s) 


Min 
time (s) 


261,369 


554.5976 


565.765 


545.513 


569,738 


986.061 1 


1000.902 


971.052 


1,150,684 


1911.162 


1932.173 


1890.286 


2,301,367 


3616.403 


3,673.40 


3,598.07 



As shown in the graph, there is a clear linear relationship 
between the processing time and data size. The intercept 
of the model is significant, at 155 seconds, which should 
correspond to the lead time for batch processing. 

Table 2 and Figure 6 give the results of the scaling 
benchmark using regression in a power model, (time 
(sec)) = 13132(number of node)" 0921 R2=0.9989. The 
intercept of the model is significant and this result 
shows that using double the number of nodes reduces 
processing time by 47%. 

The motivation for developing the current system was 
to simplify the use of large scale administrative databases 
in epidemiological and health service research, and for 
policy evaluation. We believe the developed system will 
be useful and will contribute to the above goal for the 
following reasons. 

Firstly, the developed system achieves satisfying scaling 
for conversion of a large scale dataset for parallelization 
with Hadoop. Because of the overhead of managing each 
node, adding additional nodes yields a diminished vo- 
lume of transactions, but retains adequate scaling ability. 
Processing the 950 million log entries for administrative 
activities in the performance test took one hour at a cost 
of 10 US dollars using a parallel environment such as 
the Amazon system with one master and eight slave 
nodes. To complete the task in 10 minutes requires one 
master plus 48 slave nodes at a cost of about 50 US 



Table 2 Results of the scalability benchmark 



Number of 
slave nodes 


Average processing 
time (s) 


Max 
time (s) 


Min 
time (s) 


2 


6,892.868 


6,986.503 


6,844.374 


4 


3,616.403 


3,673.398 


3,598.065 


8 


2,063.208 


2,087.145 


2,037.378 


12 


1,301.092 


1,326.391 


1,280.319 


16 


1,022.917 


1,133.464 


985.958 


24 


677.832 


690.765 


670.458 


48 


379.049 


401.013 


370.314 



dollars. Thus, the system allows users to choose between 
the tradeoff of time response and cost. 

Secondly, the current system uses only free and open- 
source components. It uses the Hadoop framework for 
distributed data processing, and the Pig Latin language 
for script development. These are free-share open- 
source software (OSS) products under the Apache Li- 
cense. System development of the UDFs in the current 
study is also an OSS project, freely available for use and 
alteration. Furthermore, Pig can execute the same script 
on local computers even without Hadoop. Durok is a 
subproduct that has developed from this use of Pig, 
allowing the currently developed system to be used as 
original application software. At present, Durok is open- 
source software available under an Apache License; 
however, it is not an official project of the Apache Foun- 
dation. The Durok system can be applied to small data- 
sets that can be processed without a distributed data 
processing environment. 

Finally, the system achieved quick response in proces- 
sing the large administrative database to allow conve- 
nient ad-hoc analysis in a trial-and-error fashion. Quick 
and easy access to large databases allows researchers and 
analysts broader opportunities for investigating innovative 
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Figure 5 Processing speed benchmark. Dots indicate the average 
processing time for 20 trials. The line indicates the prediction 
equation fitted with a linear regression. 
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Figure 6 Scaling benchmark. Dots indicate the average processing 

time for 20 trials. The line indicates the prediction equation fitted 

with a power regression. 
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research questions, generating hypotheses to be tested in 
formal research, and ad-hoc monitoring of adverse events. 

The current system still needs further development of 
the UDFs to allow more complicated data transform- 
ation with simpler scripts. Currently, the proposed UDFs 
are functionally separated into grouping and date func- 
tions owing to restrictions in the format design of 
GroupFilterFormat. However, users may wish to identify 
patterns in timing and types of administered pharmaceu- 
ticals through data mining to find best practice patterns 
in a real setting. To satisfy such requirements, the for- 
mat design needs further development to allow flexibility 
in setting a reference time point in GroupFilterFormat. 

We believe that the present system is generalizable to 
any large scale administrative database which has a simi- 
lar data format to the DPC data. Another challenge is to 
further improve efficiency in data processing with 
increased data sizes. The Reduce process is a limiting 
factor in improving the speed of data processing. Cur- 
rently the proposed scheme needs two iterations of the 
Reduce step to transform a table (by Innergroup) and 
numeric calculation. How to decrease the number of Re- 
duce processes will be the key to achieving further 
speedup. This may be possible by developing original 
UDFs for numeric processing, or by reordering data pro- 
cessing to avoid the second Reduce step. 

Conclusions 

Using a MapReduce program with a Pig Latin-based 
script, we developed a tool to transform ultra large ad- 
ministrative data into a wide table format. This tool is 
very simple and easy to use by researchers, and shows 
promise in applying MapReduce technology to efficient 
data processing in health services and clinical research 
with large scale administrative data. 

Availability and requirements 

• Project name: The University of Tokyo DPC project 

• Project home page: http://github.com/hiromasah/ 
charsiu 

• Operating system(s): Platform independent 

• Programming language: Java 

• Other requirements: Java 1.3.1 or higher, Hadoop 
1.0.0, Pig 0.9.2 

• License: Apache 

• Any restrictions to use by non-academics: This is 
an open source project. 

Additional files 



Additional file 1: Appendix 1. Dataset format. 
Additional file 2: Appendix 2. Pig script. 
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